title: ‘Visualizing Comparisons and Compositions’ output: html_document: number_sections: true toc: true toc_float: true css: !expr here::here(“global/style/style.css”) highlight: kate word_document: toc: true editor_options: markdown: wrap: 100 canonical: true chunk_output_type: inline —
Welcome to our tutorial on visualizing comparisons and compositions using {ggplot2} in R. Analyzing and comparing categories, as well as visually representing the composition of elements, are fundamental aspects of data visualization in R.
In this lesson, we’ll explore a variety of visualization techniques for these two applications!
We will start our journey by examining bar charts, a foundational element in data visualization. Bar charts offer exceptional versatility, allowing straightforward comparisons across diverse categories. However, when our focus shifts to understanding proportions or compositions, pie charts come into play. These charts prove invaluable in visually representing data where the cumulative sum of all parts constitutes a whole.
This tutorial aims to equip you with the skills to effectively visualize aggregated data and make comparisons. By the end, you’ll have a solid understanding of how to use {ggplot2} functions to visualize and compare groups in your data, enhancing your ability to draw meaningful insights from your datasets. Let’s get started!
geom_col(),
geom_errorbar(), and position
adjustments.coord_polar()
with geom_col().In this lesson we will use the following packages:
{tidyverse} for data wrangling and data
visualization
{here} for project-relative file paths
In the realm of public health, data often comes in forms that require us to compare metrics between subgroups or understand the relative contributions to a total count.
Today we’ll be looking at sub-national tuberculosis (TB) data from Benin. The data was provided by WHO and hosted on this DHIS2 dashboard. We will be looking at a subset of this data, with records of treatment outcomes for TB patients in hospitals, from 2015 to 2017.
Let’s import the tb_outcomes data subset.
# Import data from CSV
tb_outcomes <- read_csv(here::here('data/benin_tb.csv'))
# Print data frame
tb_outcomesThe data contains records of the number of new and relapse cases
started on treatment (cases). The case counts are
disaggregated by a number of groupings: time period, health facility,
treatment outcome, and diagnosis type.
Here are the detailed variable definitions for each column:
period and
period_date: records the time frame for
each entry in the dataset. The periods are marked quarterly, starting
from the first quarter of 2015 (represented as 2015Q1) up
to the last quarter of 2017 (2017Q4). This allows us to
track the progression and changes in TB cases over time.
hospital: indicates the specific
health facility where the TB cases were recorded. These facilities
represent different geographical and administrative areas, each with
unique characteristics and capabilities. This subset of the data
contains treatment outcome records from five health facilities: St Jean
De Dieu, CHPP Akron, CS Abomey-Calavi, Hopital Bethesda, Hopital
Savalou, and Hopital St Luc. This information can be used to analyze and
compare the prevalence and treatment outcomes of TB across different
facilities.
outcome: This column categorizes
the TB cases based on the diagnosis type and the stage of their
treatment journey. Each variable corresponds to a different aspect of
the patient’s diagnosis and treatment progress:
completed: started on treatment and
the respective outcome is indicated as completed.
cured: started on treatment and for
whom the respective outcome is indicated as cured (and backed by at
least two clear sputum smear results)
died: Represents the TB cases that
resulted in the death of the patient during treatment, including both
bacteriologically confirmed and clinically diagnosed cases.
failed: These are the cases where
treatment failed, which is confirmed for bacteriologically tested
patients and observed in clinically diagnosed cases.
unevaluated: These are the TB cases
that started treatment but do not have an evaluated treatment outcome
available. This applies to both bacteriologically confirmed and
clinically diagnosed cases.
diagnosis_type: This column
categorizes the TB cases based on the method of diagnosis. There are two
types of diagnosis included in this dataset:
bacteriological: These are the cases
where the presence of TB bacteria is confirmed through bacteriological
examination methods, such as sputum smear microscopy or culture
methods.clinical: These are the cases where TB
diagnosis is made based on clinical signs and symptoms, without
bacteriological confirmation. This usually happens when bacteriological
tests are either unavailable or inconclusive, and the patient presents
TB symptoms.cases: the number of new or relapse
cases started on treatment. This allows for quantitative analysis of the
TB cases, such as the total number of new cases over a specific period
or the number of cases that completed treatment in a particular health
facility.
Visualizations allow you to ask questions of your data, and the type of question you want to ask will determine the chart type. A ton of questions you might ask about your data will hinge on a comparison between values. Which region was responsible for the most cases or deaths? What year had the best treatment outcomes? These questions are answered quickly by gathering and comparing different values in your data.
One of the most common chart types are bar charts, and for good reason! They are often the most efficient and effective way of conveying counts per group or category so that comparisons can be made between the bars.
They are especially effective when the categories are ordinal (i.e., contain an inherent order; e.g., age groups), or time based (e.g., months of the year).
But when should we consider choosing a bar/column chart? Bar charts are ideally used when dealing with data that can be grouped into categories and when wanting to make comparisons between these categories.
When using {ggplot2}, we can use the geom_col() function
to create a bar plot of a categorical variable against a numerical
variable.
Let’s exemplify this by visualizing the Number of cases per
treatment outcomes in the tb_outcomes dataset:
# Basic bar plot example 1: Frequency of treatment outcomes
tb_outcomes %>%
# Pass the data to ggplot as a basis for creating the visualization
ggplot(
# Specify that the x and y axis variables
aes(x = outcome, y = cases)) +
# geom_col() creates a bar plot
geom_col() +
labs(title = "Number of cases per treatment outcome")As we can see from the graph above, geom_col() has
automatically tallied the total number of cases per outcomes, across all
periods, hospitals, and diagnosis types.
We can also change the x-axis variable to any other categorical variable in the data as below:
# Basic bar plot example 2: Case counts per hospital
tb_outcomes %>%
ggplot(aes(x = hospital, y = cases)) +
geom_col() +
labs(title = "Number of Cases per Hospital")In order to properly visualize the various categories, we can
generate a horizontal bar plot, by integrating the
coord_flip() function into our previous code.
# Basic bar plot example 3: Horizontal bars
tb_outcomes %>%
ggplot(aes(x = hospital, y = cases)) +
geom_col() +
labs(title = "Number of Cases per Hospital") +
coord_flip()We will use coord_*() functions later in the lesson to
create circular plots.
While the previous bar charts depicted case distribution across a single categorical variable, we can elevate our understanding by introducing a second categorical variable and creating stacked bar charts.
This can be done in ggplot() by setting fill color to a
categorical variable:
# Stacked bar plot:
tb_outcomes %>%
ggplot(
# Fill color of bars by the 'outcome' variable
aes(x = hospital,
y = cases,
fill = outcome)) +
geom_col()As seen above, stacked plots maintain our primary categories on the axis while visually segregating contributions from different subgroups by splitting the bars into smaller segments.
position argument to
"dodge" in geom_col() to display bars side by
side:# Grouped bar plot:
tb_outcomes %>%
ggplot(
aes(x = hospital,
y = cases,
fill = outcome)) +
# Add position argument for side-by-side bars
geom_col(position = "dodge")# Grouped bar plot: split into 2 bars
tb_outcomes %>%
ggplot(
# Fill color of bars by the 'diagnosis_type'
aes(x = hospital,
y = cases,
fill = diagnosis_type)) +
geom_col(position = "dodge")Question 1: Basic bar plot
Write the adequate code that generates a basic bar chart of the
number of cases per quarter with period_date
on the x-axis.
Question 2: Stacked bar plot
Create a stacked bar chart to display treatment outcomes over different time periods.
Visualizing data with error bars allows for a clearer understanding of the variability or uncertainty inherent in the dataset. Error bars can indicate the reliability of a mean score or an individual data point, providing context to the plotted values.
To implement error bars in {ggplot2}, we use the
geom_errorbar() function. This requires a value for the
range of your error, typically defined by the standard deviation,
standard error, or confidence intervals.
Here’s an example of how to add error bars to our grouped
geom_col() plot above.
First, let’s create the necessary summary data since we need to have some kind of error measurement. In our case, we will compute the standard deviation:
hosp_dx_error <- tb_outcomes %>%
group_by(hospital, diagnosis_type) %>%
summarise(
total_cases = sum(cases, na.rm = T),
error = sd(cases, na.rm = T))
hosp_dx_error## # A tibble: 12 × 4
## # Groups: hospital [6]
## hospital diagnosis_type total_cases error
## <chr> <chr> <dbl> <dbl>
## 1 CHPP Akron bacteriological 695 19.4
## 2 CHPP Akron clinical 180 6.42
## 3 CS Abomey-Calavi bacteriological 671 15.8
## 4 CS Abomey-Calavi clinical 120 4.05
## 5 Hopital Bethesda bacteriological 139 3.90
## 6 Hopital Bethesda clinical 117 5.77
## 7 Hopital Savalou bacteriological 70 1.70
## 8 Hopital Savalou clinical 10 0.651
## 9 Hopital St Luc bacteriological 149 3.89
## 10 Hopital St Luc clinical 19 0.765
## 11 St Jean De Dieu bacteriological 100 3.16
## 12 St Jean De Dieu clinical 71 3.16
# Recreate grouped bar chart and add error bars
hosp_dx_error %>%
ggplot(
aes(x = hospital,
y = total_cases,
fill = diagnosis_type)) +
geom_col(position = "dodge") + # Dodge the bars
# geom_errorbar() adds error bars
geom_errorbar(
# Specify upper and lower limits of the error bars
aes(ymin = total_cases - error, ymax = total_cases + error),
position = "dodge" # Dodge the error bars to align them with side-by-side bars
)With compositions, we want to show how individual parts make up the whole. We could try to answer these questions using chart types discussed above, but there are many chart types devoted to compositions that do a much better job. These part-to-whole chart types immediately focus our visual attention on the relative importance of every part to the total value in the data.
To visualize a composition, or part-to-whole relationship, we need two ingredients: the parts, and the whole.
The stacked bar charts we created earlier do a somewhat okay job of this:
# Regular stacked bar plot
tb_outcomes %>%
ggplot(
aes(x = hospital,
y = cases,
fill = outcome)) +
geom_col()This is achieved by setting the position argument to
"fill" in geom_col().
# Percent-stacked bar plot
tb_outcomes %>%
ggplot(
aes(x = hospital,
y = cases,
fill = outcome)) +
# Add position argument for normalized bars
geom_col(position = "fill")In this section, we will delve into circular data visualizations, particularly pie charts, to demonstrate categorical data distribution. These types of plots can be quite polarizing in the data visualization community due to their tendency to distort data interpretation. However, when employed judiciously, they can offer an intuitive snapshot of proportions within a dataset.
Before we can visualize the data, we must first aggregate it to get the total counts for each treatment outcome category, ensuring we have a clear representation of each segment of our dataset.
outcome_totals <- tb_outcomes %>%
group_by(outcome) %>%
summarise(
total_cases = sum(cases, na.rm = T))
outcome_totals## # A tibble: 6 × 2
## outcome total_cases
## <chr> <dbl>
## 1 completed 573
## 2 cured 1506
## 3 died 130
## 4 failed 30
## 5 lost 87
## 6 unevaluated 15
A pie chart is basically a round version of a single 100% stacked bar.
# Single-bar chart (precursor to pie chart)
ggplot(outcome_totals,
aes(x=4, # Set arbitrary x value
y=total_cases,
fill=outcome)) +
geom_col()coord_*() functions can
change a plot’s perspective, like tweaking aspect ratios or axis
limits.coord_polar(), which will shape our data into slices for a
pie chart.y aesthetic to angles (using the
theta argument), we’ll collaboratively create a visual that
clearly displays the distribution of our categorical data.# Basic pie chart
ggplot(outcome_totals,
aes(x=4,
y=total_cases,
fill=outcome)) +
geom_col() +
coord_polar(theta = "y") # Change y axis to be circular{ggplot2} lesson to improve
visual comparisons and compositions using bar charts and pie
charts.tb_outcomes dataset, we applied these
techniques to real-world public health data, emphasizing TB treatment
outcomes in Benin.geom_col() and
coord_polar().{ggplot2} tools to effectively present
your data.